Increasing the Quality and Quantity of Source Language Data for Unsupervised Cross-Lingual POS Tagging

نویسندگان

  • Long Duong
  • Paul Cook
  • Steven Bird
  • Pavel Pecina
چکیده

Bilingual corpora offer a promising bridge between resource-rich and resource-poor languages, enabling the development of natural language processing systems for the latter. English is often selected as the resource-rich language, but another choice might give better performance. In this paper, we consider the task of unsupervised cross-lingual POS tagging, and construct a model that predicts the best source language for a given target language. In experiments on 9 languages, this model improves on using a single fixed source language. We then show that further improvements can be made by combining information from multiple source languages.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

بررسی مقایسه‌ای تأثیر برچسب‌زنی مقولات دستوری بر تجزیه در پردازش خودکار زبان فارسی

In this paper, the role of Part-of-Speech (POS) tagging for parsing in automatic processing of the Persian language is studied. To this end, the impact of the quality of POS tagging as well as the impact of the quantity of information available in the POS tags on parsing are studied. To reach the goals, three parsing scenarios are proposed and compared. In the first scenario, the parser assigns...

متن کامل

Cross-Lingual Transfer Learning for POS Tagging without Cross-Lingual Resources

Training a POS tagging model with crosslingual transfer learning usually requires linguistic knowledge and resources about the relation between the source language and the target language. In this paper, we introduce a cross-lingual transfer learning model for POS tagging without ancillary resources such as parallel corpora. The proposed cross-lingual model utilizes a common BLSTM that enables ...

متن کامل

Empirical Gaussian priors for cross-lingual transfer learning

Sequence model learning algorithms typically maximize log-likelihood minus the norm of the model (or minimize Hamming loss + norm). In cross-lingual part-ofspeech (POS) tagging, our target language training data consists of sequences of sentences with word-by-word labels projected from translations in k languages for which we have labeled data, via word alignments. Our training data is therefor...

متن کامل

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

A language-independent and fully unsupervised approach to lexicon induction and part-of-speech tagging for closely related languages

In this paper, we describe our generic approach for transferring part-of-speech annotations from a resourced language towards an etymologically closely related non-resourced language, without using any bilingual (i.e., parallel) data. We first induce a translation lexicon from monolingual corpora, based on cognate detection followed by cross-lingual contextual similarity. Second, POS informatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013